Mixture-Modeling with Unsupervised Clusters for Domain Adaptation in Statistical Machine Translation

نویسنده

  • Rico Sennrich
چکیده

In Statistical Machine Translation, in-domain and out-of-domain training data are not always clearly delineated. This paper investigates how we can still use mixture-modeling techniques for domain adaptation in such cases. We apply unsupervised clustering methods to split the original training set, and then use mixture-modeling techniques to build a model adapted to a given target domain. We show that this approach improves performance over an unadapted baseline, and several alternative domain adaptation methods. Posted at the Zurich Open Repository and Archive, University of Zurich ZORA URL: https://doi.org/10.5167/uzh-62826 Published Version Originally published at: Sennrich, Rico (2012). Mixture-modeling with unsupervised clusters for domain adaptation in statistical machine translation. In: 16th EAMT Conference, Trento, Italy, 28 May 2012 29 May 2012, 185-192. Mixture-Modeling with Unsupervised Clusters for Domain Adaptation in Statistical Machine Translation Rico Sennrich Institute of Computational Linguistics University of Zurich Binzmühlestr. 14 CH-8050 Zürich

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Multi-Domain Translation Model Framework for Statistical Machine Translation

While domain adaptation techniques for SMT have proven to be effective at improving translation quality, their practicality for a multi-domain environment is often limited because of the computational and human costs of developing and maintaining multiple systems adapted to different domains. We present an architecture that delays the computation of translation model features until decoding, al...

متن کامل

Extracting and Selecting Relevant Corpora for Domain Adaptation in MT

The paper presents scheme for doing Domain Adaptation for multiple domains simultaneously. The proposed method segments a large corpus into various parts using self-organizing maps (SOMs). After a SOM is drawn over the documents, an agglomerative clustering algorithm determines how many clusters the text collection comprised. This means that the clustering process is unsupervised, although choi...

متن کامل

Phrase Training Based Adaptation for Statistical Machine Translation

We present a novel approach for translation model (TM) adaptation using phrase training. The proposed adaptation procedure is initialized with a standard general-domain TM, which is then used to perform phrase training on a smaller in-domain set. This way, we bias the probabilities of the general TM towards the in-domain distribution. Experimental results on two different lectures translation t...

متن کامل

Unsupervised Adaptation for Statistical Machine Translation

In this work, we tackle the problem of language and translation models domainadaptation without explicit bilingual indomain training data. In such a scenario, the only information about the domain can be induced from the source-language test corpus. We explore unsupervised adaptation, where the source-language test corpus is combined with the corresponding hypotheses generated by the translatio...

متن کامل

Pseudo In-Domain Data Selection from Large-Scale Web Corpus for Spoken Language Translation

This paper is concerned with exploring efficient domain adaptation for the task of statistical machine translation, which is based on extracting sentence pairs (pseudo in-domain subcorpora, that are most relevant to the in domain corpora) from a large-scale general-domain web bilingual corpus. These sentences are selected by our proposed unsupervised phrase-based data selection model. Compared ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012